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Abstract 



The block-sorting text compression algorithm can be viewed as associ- 
ating a context with each character to be compressed, and then sorting the 
characters on their contexts. Normally, the context associated with each char- 
acter is the string to the left of the character. Recently, Ratushnyak suggested 
that it might be possible instead to build a context by interleaving characters 
taken alternately from the left and right. We show that transformations of this 
type are not reversible in general unless additional information is supplied. 
Further, the amount of additional information needed to reverse the transfor- 
mation is necessarily large, and so such transformations are unlikely to be of 
interest as part of a compression algorithm. 
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1 Introduction 



The first step of the block-sorting compression algorithm [1] transforms a string 
by sorting its characters on their left-contexts. (The left-context of character i is 
the string composed from characters i — 1, i — 2, i — 3, . . . , treating the string 
as cyclic.) The transformation is interesting because it is reversible and because it 
yields a result that is amenable to compression by straightforward algorithms. 

Recently, A. Ratushnyak suggested a possible variant of the block-sorting com- 
pression algorithm that would sort the characters on contexts composed of an inter- 
leaving of the left- and right-contexts [2] . In his variant, the context for character 
i consists of the string composed from characters i — 1, i + 1, i — 2, i + 2, i — 3, 

1 + 3 etc. Context information from both sides is likely to be a better predictor 
for a character than context information from one side, and indeed this suggestion 
results in significantly better compression of the result of the transformation. 

Ratushnyak provides a worked example demonstrating how one can reverse the 
transformation on a 13-character string. The techniques he uses are interesting and 
may lead to further insights. However, he gives neither a complete algorithm for 
reversing his transformation nor a proof that any such algorithm exists. This left 
open the question of whether his suggestion could indeed form the basis of a useful 
compression algorithm. 

2 Ratushnyak 's transformation is not reversible 

At the suggestion of Jim Saxe, we applied Ratushnyak's algorithm to short binary 
strings and we soon found collisions. That is, we found inputs that are distinct 
under rotation yet which are mapped to the same string by Ratushnyak's transfor- 
mation. The shortest colliding strings have length 6. An example colliding pair 
is 110100 and 101100, which both map to 110100 under Ratushnyak's transfor- 
mation. As the length of the strings increases, the number of collisions quickly 
becomes large. 

3 A more general negative result 

The presence of collisions in Ratushnyak's transformation does not necessarily 
undermine its use in a compression algorithm since we may add additional bits 
to resolve ambiguity. Indeed, in the original block-sorting algorithm, we have to 
add a log «-bit string in order to distinguish the n possible rotations of the decoded 
string. 
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In general, we have to use at least log M bits to resolve an M-way collision. In 
the following, we construct a set of d 1 strings of length n, where c > 1, all of which 
are transformed to the same string under Ratushnyak's algorithm. In other words, 
we need linearly many additional bits, n log c of them, to resolve the ambiguity. 

For an even number n > 0, consider a string s with the following properties: 

• it consists of \n 0's and n l's; 

• it starts with 0 and ends with 1 ; 

• it does not contain two consecutive l's; and 

• it does not contain three consecutive 0's. 

Now, we claim that: 

1. Any such string s is transformed to the string 1 • • • 10 • • • 0. 

For each 0 in s, its context starts with either 10 (when it appears first in two 
consecutive 0's), 01 (when it appears second in two consecutive 0's), or 11 
(when it is sandwiched by two l's). For each 1, since it is always between 
two 0's, its context always starts with 00. Thus, all the l's have contexts that 
sort before the 0's. Therefore, s is transformed to the string 1 • • • 10 • • • 0. 

2. The number of such strings is 0(2" /y/n). 

We can construct such strings as follows. Consider a permutation of n/2 
a's and n/2 b's. We first insert a 1 after each letter in the string. Then, we 
replace each a with 0 and each b with 00. It is easy to verify that strings 
constructed this way satisfy the required constraints. The number of such 
permutations is ( n " 2 ) = G(2 n /^/n). 

We thus have constructed 0(2" / sfn) strings with length |« so that they are all 
transformed to the same string 1 • • • 10 • • • 0 under the algorithm where the context 
of each character starts with its two neighboring characters. Or equivalently, we 
need to use about 2/5 = 0.4 additional bits, per bit in the original string, to resolve 
the ambiguity. A similar, but more complex construction shows that the number of 
additional bits required can be as high as 0.4057. 

Notice that this argument applies not only to Ratushnyak's transformation but 
also to any similar transformation in which the context of a character starts with its 
two neighboring characters, in either order. 
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